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Abstract 

The codons, sixtyfour in number, are distributed over the coding parts of DNA sequences. 
The distribution function is the plot of frequency- versus-rank of the codons. These distribu- 
tions are characterised by parameters that are almost universal, i.e., gene independent. There 
is but a small part that depends on the gene. We present the theory to calculate the universal 
(gene-independent) part. The part that is gene-specific, however, has undetermined overlaps 
and fluctuations. 
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1 Introduction 

The methods of statistical linguistics are used in recent years to study DNA sequences[l]. The genome 
projects generate large volumes of data on DNA. Fast and reliable computational tools to analyse 
this huge data of billlions of bases are required. The idea is to identify features in the sequences and 
to correlate them with known biological functions. The methods of statistical linguistics [2] could 
provide reliable computational algorithms. This is what we investigate here. 

The sequences are made of the nucleotide bases A, C, G and T. The arrangement of the bases over 
the linear chain determines all the information there is in DNA. The regions that code for proteins, 
the coding regions (or the exons), have bases working in groups of three to make proteins. These 
triplets are called codons. The biologically meaningful words are these codons. The noncoding parts 
consist of the introns and the flanks. These are presumed important in regulatory and promotional 
activities. The biologically meaningful word structures in these regions are not known. A gene 
generally comprises of a number of exon regions separated by introns. Since the biological functions 
thus far are associated with the triplet codons, we concern ourselves only with these triplet words, the 
codons. Therefore, in our analysis, instead of an entire gene, we consider the coding DNA sequence 
(CDS) region of the gene, where the exon segments are put together, splicing the introns out. 

Natural languages are characterised by structures determined by rules of grammar. The words 
put together with these rules carry sense. The rules give coherence and meaning to long texts. The 
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languages have this long-range order. The frequency spectra show the presence of the long periods. 
These are identified by the -p type behaviour in the low frequency region [3]. Words placed at random 
will have quite different frequency spectrum with no long-range behavior. The early work on natural 
languages dealing with the statistical distributions of words, done by Zipf [4], assigned ranks to the 
words. The word most frequent has rank=l; the next most has rank=2 and so on. Zipf showed that 
for natural languages the plot of frequency, f n , versus rank, n, is of the power-law form: 

where f\ is the frequency of rank 1. In the Zipf's original analysis the power-index a was assumed 
to be one. Subsequent studies have allowed for deviations from one. 

The DNA sequence of the letters A, C, G and T does have p frequency spectrum[5]. It is possible, 
therefore, that the sequences have long-range order and underlying grammer rules. The opinion on 
this issue remains divided[6]. Some have taken the view that DNA is language-like [7]. In the coding 
regions the long periods have lower incidence than in the non-coding parts. The Zipf-type fits in 
DNA regions (with overlapping n-tuples) have shown that the index a is higher in the non-coding 
segments over the coding ones. The averaged a over several overlapping n-tuples is nearer to the 
value for natural languages for non-coding segments than the coding ones[l,7]. 

The body of evidence presented in support of the language-like features of DNA has remained 
ambiguous[8]. For one it is not known how the power-law Zipf-behaviour of natural languages is 

4 



connected to the long-range correlations [9]. It is known, for instance, that pseudorandom sequences 
satisfy Zipf- behaviour. Further, it is known that the frequencies of A, C, G and T vary somewhat 
more for the introns and the flanks over the exons[10]. The "long-range" order that is observed for 
these noncoding regions may be an outcome of the frequency differences. The higher value of the 
Zipf index for the noncoding segments may again be ascribed to these differences in the frequencies 
of the bases. 

The importance of statistical linguistics as a computational tool remains insufficiently explored for 
DNA sequences. While the Zipf law is probably not connected to the deeper features of languages 
such as the universal grammar, the coherence and the long periods, it could still be useful. For 
instance, the index a of languages could be (and is) used in computer algorithms to identify authors. 
The texts generated by authors vary slightly in their Zipf index. The index, therefore, identifies the 
author. Could one use similar algorithms to identify regions from the genome segments and relate 
them to their biological functions? 

As precision and reliability are important we have weighed the merits of power-law fits over 
exponential fits. Since we are solely concerned with non-overlapping 3-tuples (i.e. the codons), we 
find the exponential fits have consistently lower x 2 . [Chi-square (x 2 ) is the sum of the ratio of the 
squared difference between observed value at the i th point (oj) and the expected value at the i th point 
(ej) to the expected value at the i th point (e^), i.e., % 2 = J2i ^°'~ e ^ ; where the sum i runs over the 
number of points of the fit. The value of x 2 depends on the total number of points to be fit minus one, 
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sometimes called the degree of freedom, df.] The exponentials, therefore, provide better fits. That 
the power-law fits for DNA sequences are worse than the exponentials have also been observed by 
others [11]. The power law of Zipf is characterised by two parameters, the index a and the frequency 
of rank one, i.e. f±. The number of parameters for the exponential fit is of interest to us. The Zipf 's 
law is used to find the relationship connecting vocabulary to the text-length. Such connection does 
exist for the exponential fit as well. 

The parameters of the exponential rank- frequency relation depend crucially on the text-length. 
Once this parameter is known, the approximate length of the segment gets known as well. Indeed, 
the exponential fits are largely determined by two quantities, the frequency of rank 1, i.e., f\ and the 
text-length of the sequence. There is however a small part that is characteristic of the gene. This 
signature of the gene is potentially useful in generating algorithms to identify the gene and relate to 
the biological functions. 

2 The Approach 

Out of the four bases A, C, G & T we have 4 x 4 x 4 = 64 possible triplets. Three combinations, 
namely, TAA, TAG & TGA are the stop condons. Thus 64 - 3=61 is the meaningful vocabulary. 
The codon most frequent has rank n=l, the next most has n=2 and so on. We define frequency, 
/, of a particular codon as the number of times it appears in the sequence. [Note this definition is 
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different from some of the references where f n = Nu ™\ er } of w ° rds ° f ra f " • The frequency of rank n is /„. 

J " 1 otal number oj words -i j j " 

Here both frequency(/) and rank(n) are dimensionless. 

Observations on the CDS reveal that many codons may have the same frequency. Note that the 
CDS we are dealing with are relatively short sequences of several hundred to several thousand bases. 
This problem of multiple codons having the same frequency is called frequency degeneracy. 

First, as we consider only codons, 61 in number, the problem of saturation of vocabulary for 
large text-length is clear. However, for most genes we observe that the actual usage of codons is 
smaller than 61. The codon usage is sometimes referred to as the vocabulary, i.e. the total number 
of different codons, used in the CDS. 

From the Zipf's law [equation(l)] with a—1 we have 

ln{f n ) = ln(fi) - ln(n) 

If we plot ln(/ n ) vs ln(n) we have a straight line with slope -1 and intercept on the y-axis at ln(/i). 
Clearly, the maximum rank is just equal to f\. When a deviates from 1, fi and the maximum rank 
are connected to each other through a. The maximum rank (i.e. the vocabulary) along with fi (or 
a) determine the text-length 1, i.e., the total number of triplets, as follows : 



1 = /l+/2 + fl + f3 + ...■ + fn 




7 



Thus, a may be thought of as a function of f\ and the text-length 1. We want to arrive at the 
corresponding relation for our exponential fits. 

3 The Exponential Fit 

All the degenerate frequencies are assigned different rank number. Thus if CCG and CAG have the 
same frequency of occurrence they belong to two different ranks (one following the other) in our 
work. Therefore, here too, the codon usage, maximum rank and vocabulary are synonymous. The 
exponential function that connects frequency to rank is 

f n = fiexp{-P(n - 1)} (2) 

where /3, a dimensionless constant for a particular gene, is to be determined from the fit. 

We have tried this fit function on over 300 CDS. The CDS are sourced from the EMBL[12] and 
the GenBank[13] data bases. Table 1 gives the values of (3 for some of the sequences under study. 
The plots showing the fit is figure(l). 

The index (3 in the exponential of equation(2) takes different values for the genes. It turns out, 
however, that (3 is not completely a free parameter. Indeed, from Table 1, we notice that CDS that 
have text-lengths and also /i that are close have similar, though not identical, (3 values. Notice, for 
instance, the /3-globin CDS from the chicken and the clawed frog have the same 1 and /i, 147 and 
9 respectively; whereas the lysozyme CDS from the fish, Cyprinus carpio has 146 as 1 and 9 as f\. 
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The (3 values for the /3-globin CDS of the chicken and the frog are 0.05773 and 0.05772; while the 
lysozyme CDS, though functionally quite unrelated to the /3-globin, has the (3 value of 0.06056. So 
the value of (3 is determined to a considerable extent by fi and the text-length of the sequence, 1. 
There is but a part in (3 that is characteristic of the gene. 

4 Plot of p vs. /i 

Figure(2) gives plots of (3 vs fi for four complete CDS coding for a-globin, /3-globin, phosphoglycerate 
kinase and globulin proteins. The \ 2 values indicate that the relationship between f3 and fi is linear 
to a good approximation. The plot for each CDS involves data on the gene from different species. 
These are sourced from GenBank. Each of the linear plots are specific to the gene. The evolution 
of the genes, as we move higher in the evolutionary hierarchy, does not significantly alter the overall 
text-length of the CDS regions. 

The slope of the globin CDS, the a and the (3, are nearly equal. As we show in the subsequent 
pages the value of (3 is considerably determined by f\ and 1. There is but a small part that is unique 
to the gene. For the case of the a and the (3 globins notice that the text-lengths of these CDS vary 
in a small range between 143 and 147. Table 1 shows that any two quite unrelated CDS can have (3 
values that are close provided their text-lengths and the f\ are nearly equal. 

The plots in figure(3) of (3 vs f\ keep the text-length 1 fixed at 140 for the same four genes. 
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Though the closeness in the values of the slope indeed show the influence of 1 on the (3 value, the 
small differences indicate the presence of the 1-independent part in the (3 value. 

That the (3 values are not completely determined by }\ and 1, but do have a component, albeit 
small, coming from the genes is illustrated in our next plot, figure(4). A number of different CDS, 
each from a different organism, were chosen and cut at three different text-lengths 30, 140 and 300, 
i.e., we considered only the first 30, 140 and 300 triplets respectively out of the whole CDS. The 
plot of (3 vs fi for these three different text-lengths indicates that when the text-length is held fixed, 
but the genes are varied, the exponential gives a better fit over the linear. It is noteworthy that 
even though the genes are unrelated in as far as their biological functions are concerned, the codon 
distributions, described by the experimental fit of figure(4), are not completely unrelated. 

Taken together, the two plots, figure(3) and figure(4), tell us: 

(i) When the text-length, 1, is held fixed, and the genes are not varied, the plot of (3 vs fi is linear 
and 

(ii) When the text-length, 1, is held fixed, and the genes are varied, the plot of (3 vs fi is exponential. 
Thus, we conclude that the value of (3 does have a part that is gene specific. 
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5 Plot of p vs 1 

f3, as we have observed from Table 1, depends on fi and 1. Beyond that there is the part that is gene 
specific. In other words the parameters of the functional fit do depend, in a small way, on the gene. 
This dependence we discuss later. Here, in this section, we concern ourselves with the dependence 
of (3 on the text-length of the CDS. 

We plot (3 vs 1 keeping f\ fixed. The plots in figure (5) show the dependence for four different 
values of /i, namely /i=7, /i=9, /i=20, and /i=38. 

In plotting figure(5) we considered the f\ values of the natural CDS. We had the option to cut 
the CDS into fragments to suit our value of /i. This procedure turned out to be arbitrary as the 
fi value may remain fixed over some hundred bases. Cutting into fragments is nonunique. It was, 
therefore, difficult to restrict our study of f3 vs 1 for a particular gene. For a specific CDS (from 
different species) the text-length does not vary significantly in most cases. Therefore for a fixed 
value of fi the CDS were searched over different genes. Thus fi is held fixed, but genes vary. 

Though more data for each gene could have improved the result, nevertheless the relationship 
between (3 and 1 for fixed f\ has a linear trend. As the text-length increases (3 decreases. However, 
the plots for different values of fi are not parallel. They depend on f±. The slope reaches a maximum 
at around fi = 10 and tend to decrease as we go away from /i=10 on either side. For large values 
of fi, the slopes tend to become parallel. 
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6 Theory of (3 

We have seen (3 depends on the text-length, 1, and the frequency of rank 1, /i. 

(1) When the text-length 1 is held fixed, genes not varied, (3 depends linearly on f±. The plot of (3 vs 
fi shows that is positive. 

(2) When the text-length is kept fixed, but the genes are varied, the plot of [3 vs f\ show deviations 
from linearity. An exponential fit appears more appropriate. 

(3) When fi is held fixed (genes are varied as well) the plot of (3 vs 1 shows an approximate linear 
behaviour, is negative. Note that, because of the points mentioned earlier, the variations in 1 (in 
figure 5) are over a rather small range. As a result the full 1-dependence is not clear from figure(5). 

In this section we investigate (3 theoretically. Let us denote the maximum rank by n max . Since 
the frequency of n max is almost always one, we get 

1 = }\exp{ - (3{n max - 1)} (3) 

Or, 

Umax = + 1 (4) 

The text-length 1 is just the sum over all the frequencies. Thus, 

I = £ h.e-^ (5) 

n=l 

f H _ p-/3(,n m ax-l)) 
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Substituting for n max from equation(4), we get 



I = 



A-i 



(7) 



Thus, 



P=-ln[\ - - 1)] 



(8) 



Since, the quantity 4- is small compared to one, we get, to the first approximation 



fi-l 



+ higher orders 



(9) 



Equation(9) tells us 

(i) (3 vs /1, when 1 is kept fixed, is linear; the slope is positive. 

(ii) (3 vs 1, with f\ fixed, is hyperbolic. If the text-length variation is small we expect an approximate 
linear relation with negative slope (as observed in figure(5)). How good the relation(9) is checked in 



While the relation(9) tells us that (3 is entirely determined by the ratio of f\-l to 1, figure(3) 
tells us that this quantity does have a characteristic dependence on the gene family. We conclude, 
therefore, that the relation(9) does not determine (3 entirely. There is a part that is gene specific. 
The theoretical values of (3, equation(9), is reasonably close to the values obtained from the CDS. 
The dependence of (3 on fi and 1 of equation(9) is gene-independent. It is the universal part of (3. 
The deviation from this universal part, even though small, is established in figure(3) and figure(4). 



Table 1. 
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We define the quantity that gives a measure of this deviation through the relation: 



/ 2 I 2 
= Pth ■ 



r/l-1 | l (/l-l) 2 



(10) 



where ^, = [^1 + 1^] 



We have retained the first two orders in y [of equation(8)]. This is to make sure the higher-orders 
in *j do not account for the deviations. The values of appear in the last columm of Table 1. 



We get back to Table 1 for the CDS of a-globin, /3-globin, insulin and globulin. We notice the 
value of fi increases as we walk up along the ladder of evolution. The increase in f\ increases (3 
while the text-length of the CDS does not change significantly in evolution. The results for insulin 
and the globulin CDS [Table 1] carry at least one exception. Interestingly, for both these CDS, the 
exceptional species is the same, the rabbit. The rabbit has f\ and [3 values greater than the human 
for these two CDS. The number of exceptions increase for the two globins. Some fishes show greater 
fi (and hence 0) values than the amphibian species, the African clawed frog. If we average (3 for the 
mammals we find it always exceeds the other groups. 

On the other hand, if we compare the values for each of these four CDS, a-globin and globulin 
do not show any clear pattern. In insulin, the values increase as we move from fish to mammals 



7 




Evolution 
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through amphibia. But the syrian hamster CDS is found to have lower than the clawed frog CDS. 
Besides the rat has greater compared to the human. In /3-globin, the Atlantic salmon fish stands 
as an exception. Otherwise, the value increases from amphibia, bird to mammals. But here the 
representatives of amphibia and bird have the same value, and the lemur exceeds the value of human. 
We conclude that the value of , though independent of 1 and fi, is less species specific; whereas the 
value of (3 does have evolutionary content. 

8 Gene-Specific Signatures 

In figure (2) we showed that (3 vs fi is a straight line when the genes are not varied. When the genes 
are varied, but the text-length is held constant, the relationship of (3 to fi is no longer linear. The 
exponential fit is appropriate for this case. This led us to conclude that there is a part to (3 that is 
gene-specific. 

In figure (3) we plotted (3 vs f\ keeping the genes fixed for different organisms. The slope is 
a characteristic of the gene. There is a variation in the slope as we go from one gene to another. 

The regular, namely exponential form, obtained in figure(4) in the plot of f3 vs /i, 1 being kept 
constant, tells us that the variations of /3, as we go from one gene to another, is orderly. 

(3 has a part that is gene independent. We isolate this universal component of (3 theoretically. 
This part comes out to be a function of the text-length of the sequence and the frequency of rank 1, 
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i.e. f\. The quantity , denned in equation(lO), measures the deviation of the actual (3 from this 
universal, gene- independent, contribution given in equation(lO). If the gene specific features are not 
dominant, should be close to one. Table 1 gives us the values of . Clearly, the gene specific 
components in (3 could be as high as 40% (as in insulin). We are led to conclude that the methods of 
statistical linguistics, of the Zipf variety, has the potential in algorithms to identify genes from the 
databases. 

The quantity that isolates the gene-specific components of (3 is however not unique to genes. 
Observations on (Table 1) show that the range of variations in do overlap for different genes. 
There continues to be undetermined fluctuations in the values of . Work is currently in progress 
to isolate the unique gene-identifying signatures in the Zipf-approach. 
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Figure Legends 



Figure 1. The plots of frequency (f) vs. rank (n) are the exponential functions (equation 2). Here 
different codons with the same frequency of occurrence are given consecutive ranks. The data cor- 
responds to the a-globin CDS from Duck (Acc. No. J00923). The (3 value comes out to be 0.06801. 
The text-length, 1 of the CDS is 143; h is 10. 

Figure 2. (3 is plotted as a function of fi for the natural CDS of 4 different proteins from vari- 
ous species. The relationship turns out to be linear. 

symbol CDS range of 1 m c sd 



a-globin : 



142-151 



0.0083 -0.0136 0.0029 



o 



/9-globin : 



146-149 



0.0092 -0.0258 0.0014 



A 



phosphoglycerate kinase : 417-418 0.0031 -0.0169 0.0008 



V 



Globulin : 



399-413 



0.0036 -0.0277 0.0022 



[Keys: m 



slope; c — > constant; sd — > standard deviation] 
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Figure 3. The text-length (1) is kept fixed at 140 to plot (3 as a function of fi for the CDS of the 
same 4 proteins as in figure 2. The best fit here is a linear one. 

symbol CDS m c sd 

* a-globin : 0.0080 -0.0093 0.0015 

o /3-globin : 0.0095 -0.0239 0.0013 

A phosphoglycerate kinase : 0.0094 -0.0167 0.0029 

V Globulin : 0.0097 -0.0250 0.0007 

[Keys: m — > slope; c — > constant; sd — > standard deviation] 
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Figure 4. j3 is plotted as a function of fi at 3 different values of 1. Here a number of different CDS 
from various species are chosen and cut at 3 text-lengths 30, 140 and 300. For text-lengths 30 and 
140, 15 CDS were chosen (GenBank accession numbers are AF007570, L37416, M16024, AF053332, 
AF001310, M15387, V00410, M15052, L47295, X07083, M59772, J05118, AF056080, AF170848 and 
M64656), while for text-length 300, 13 CDS were chosen (GenBank accession numbers are U02504, 
AF000953, M73993, AF054895, AF076528, AF053332, M15052, U65090, Z54364, U53218, AB013732, 
M15668 and U69698). Unlike figure 2 and figure 3, the exponential gives the better fit over the linear. 
The fit function: Y=Y0 + A.e (x ^ . 

symbol 1 Y0 A t 

* 30 0.0236 0.0357 2.7704 

o 140 0.0324 0.0481 12.8086 

A 300 0.0018 0.0133 12.4689 
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Figure 5. (3 is plotted as a function of 1 for 4 different values of fi. For each fi, natural CDS 

of that particular f 1; are considered. The relationship between (3 and 1 for fixed f x comes out to be 
linear. 

symbol f x m c sd 

* 7 -4.84xl0" 4 0.1154 6.89xl(T 4 

o 9 -8.54xl(T 4 0.1841 0.0021 

A 20 -1.63xl0~ 4 0.1133 7.14xl0" 4 

V 38 -1.33xl0~ 4 0.1458 8.85xl0~ 4 

[Keys: m — > slope; c — > constant; sd — > standard deviation] 
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Table 1: The (3 values for some CDS from different organisms. The 1 and fi stand for 
the total number of the triplet codons and the frequency of the most frequent codon 
respectively. The y 2 value signifies how good the fit is and the degrees of freedom, 
denoted by df, is simply one less than the total number of ranks. The (3 Th and are 
explained in equation (10). 



Protein 


Organism 


Accession no. 


1 


fx 


P 


x 2 


df 







a-globin 


Ark Clam 


X71386 


151 


7 


0.04221 


0.137 


52 


0.0405 


1.0415 




Rainbow Trout 


D88114 


144 


9 


0.05893 


0.202 


43 


0.0571 


1.0321 




Cyprinus carpio 


AB004739 


144 


10 


0.06890 


0.450 


45 


0.0645 


1.0691 




Black Rockcod 


AF049916 


144 


11 


0.07649 


0.594 


41 


0.0719 


1.0646 




Duck 


J00923 


143 


10 


0.06801 


0.105 


40 


0.0645 


1.0551 




Pigeon 


X56349 


143 


10 


0.06872 


0.155 


40 


0.0649 


1.0584 
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Protein 


Organism 


Accession no. 


1 


/i 


P 


x 2 


df 


(3 Th 





a-globin 


Chicken 


V00410 


142 


10 


0.07251 


0.893 


46 


0.0654 


1.1089 




House Mouse 


V00714 


142 


9 


0.06037 


0.192 


45 


0.0579 


1.0421 




Rhesus Monkey 


J004495 


143 


10 


0.06568 


0.353 


37 


0.0649 


1.0117 




Rabbit 


M11113 


143 


10 


0.06661 


0.188 


38 


0.0649 


1.0260 




Norway Rat 


U62315 


143 


10 


0.06897 


0.386 


43 


0.0649 


1.0624 




Otolemur 


M29648 


143 


13 


0.09286 


0.727 


38 


0.0874 


1.0620 




Grevy's Zebra 


U70191 


143 


13 


0.09678 


0.272 


40 


0.0874 


1.1068 




Human 


V00488 


143 


14 


0.10045 


0.007 


35 


0.0950 


1.0569 




Orangutan 


M12157 


143 


15 


0.11022 


0.487 


37 


0.1027 


1.0732 




Horse 


M17902 


143 


15 


0.11385 


0.399 


40 


0.1027 


1.1086 
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Protein 


Organism 


Accession no. 


1 


/i 


P 
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df 
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a-globin 


Sheep 


X70215 


143 


17 


0.13269 


1.153 


38 


0.1182 


1.1231 




Goat 


J00043 


143 


17 


0.13675 


1.432 


41 


0.1182 


1.1574 




Salamander 


M13365 


144 


9 


0.06240 


0.489 


51 


0.0571 


1.0928 




Clawed Frog 


X14260 


142 


10 


0.07394 


0.411 


48 


0.0654 


1.1308 


/3-globin 


Atlantic Salmon 


X69958 


149 


11 


0.07382 


0.543 


43 


0.0694 


1.0643 




Clawed Frog 


Y00501 


147 


9 


0.05772 


0.196 


45 


0.0559 


1.0326 




Chicken 


V00409 


147 


9 


0.05773 


0.324 


46 


0.0559 


1.0327 




House Mouse 


V00722 


147 


8 


0.05075 


0.099 


46 


0.0488 


1.0410 




Rabbit 


V00882 


146 


9 


0.06091 


0.133 


46 


0.0563 


1.0817 




Rat 


X06701 


147 


10 


0.06849 


0.545 


43 


0.0631 


1.0856 
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/9-globin 


Oppossum 


J03643 


148 


12 


0.08164 


2.183 


45 


0.0771 


1.0592 




Sheep 


X14727 


146 


12 


0.08413 


0.351 


39 


0.0782 


1.0761 




Goat 


M15387 


146 


13 


0.09558 


0.406 


42 


0.0856 


1.1170 




Lemur 


M15734 


148 


14 


0.10743 


1.375 


42 


0.0917 


1.1715 




Human 


AF007546 


148 


15 


0.11245 


1.530 


39 


0.0991 


1.1349 


Insulin 


Salmon 


J00936 


106 


7 


0.06425 


0.490 


45 


0.0582 


1.1040 




Clawed Frog 


M24443 


107 


8 


0.07922 


0.841 


46 


0.0676 


1.1726 




Syrian Hamster 


M26328 


111 


9 


0.08656 


0.703 


42 


0.0747 


1.1592 




Guinea Pig 


K02233 


111 


9 


0.09220 


0.815 


45 


0.0747 


1.2348 




Owl Monkey 


J02989 


109 


13 


0.14189 


1.667 


39 


0.1162 


1.2216 




Octodon degus 


M57671 


110 


12 


0.14122 


1.322 


44 


0.1050 


1.345 
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Insulin 


Rat 


J00747 


111 


12 


0.14785 


2.192 


44 


0.1040 


1.4216 




Human 


J00265 


111 


13 


0.17379 


2.795 


42 


0.1240 


1.4012 




Rabbit 


U03610 


111 


18 


0.21253 


2.940 


32 


0.1648 


1.2890 


Globulin 


Pig 


AF204929 


413 


18 


0.03901 


0.860 


58 


0.0420 


0.9286 




Bovine 


AF204928 


412 


19 


0.04173 


1.227 


57 


0.0446 


0.9348 




Djungarian Hamster 


U16673 


400 


25 


0.06195 


5.871 


59 


0.0618 


1.0024 




Norway Rat 


NMJD12650 


404 


26 


0.06505 


7.256 


59 


0.0638 


1.0196 




House Mouse 


NM_011367 


404 


28 


0.07215 


9.484 


58 


0.0691 


1.0447 




Human 


NM_001040 


403 


33 


0.09463 


18.202 


60 


0.1112 


0.8511 




Rabbit 


AF144711 


399 


39 


0.12568 


19.189 


60 


0.0998 


1.2596 
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Heat shock 


Babesia microti 


U53448 


646 


35 


0.05127 


0.867 


55 


0.0540 


0.9491 


protein 70 


Pacific Oyster 


AF144646 


660 


36 


0.05235 


1.576 


58 


0.0544 


0.9616 




Human 


U56725 


640 


40 


0.06454 


3.140 


59 


0.0628 


1.0277 




Mouse 


L27086 


642 


38 


0.06131 


2.627 


60 


0.0593 


1.0341 




Chinook Salmon 


U35064 


645 


42 


0.06640 


1.533 


60 


0.06559 


1.0124 




Rat 


L16764 


642 


48 


0.07369 


6.523 


40 


0.0759 


0.9710 


Phospho- 


Human 


X80497 


1236 


51 


0.03709 


10.391 


61 


0.0413 


0.8990 


rylase 


Rabbit 


X60421 


1236 


58 


0.04458 


7.694 


61 


0.0472 


0.9449 


kinase 


Mouse 


X74616 


1242 


47 


0.03244 


8.927 


61 


0.0377 


0.8598 


Glycogen 


Human 


J04501 


738 


44 


0.05968 


6.984 


60 


0.0599 


0.9952 


synthase 


Mouse 


U53218 


739 


37 


0.04718 


7.113 


60 


0.0499 


0.9455 
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df 
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Glycogen 


Rabbit 


AF017114 


736 


49 


0.06603 


3.001 


59 


0.0674 


0.9804 


synthase 


Rat 


J05446 


704 


28 


0.03483 


1.945 


60 


0.0391 


0.8910 


Troponin C 


Chicken 


M16024 


162 


17 


0.12374 


1.577 


45 


0.1037 


1.1938 




Human 


M22307 


161 


23 


0.19581 


3.333 


40 


0.1460 


1.3413 




Mouse 


M57590 


161 


21 


0.17806 


4.565 


42 


0.1319 


1.3496 




Rabbit 


J03462 


161 


24 


0.19294 


3.964 


36 


0.1531 


1.2606 




Clawed Frog 


AB003080 


162 


16 


0.12250 


1.370 


47 


0.0969 


1.2645 


Albumin 


Bovine 


M73993 


608 


38 


0.06437 


9.754 


59 


0.0627 


1.0265 




Human 


NM_001133 


600 


34 


0.05643 


9.235 


58 


0.0565 


0.9986 




Clawed Frog 


M18350 


607 


41 


0.06845 


15.699 


56 


0.0681 


1.0056 
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df 







Lysozyme 


Anopheles gambiae 


U28809 


141 


11 


0.08073 


0.561 


45 


0.0734 


1.0993 




Bovine 


M95099 


148 


7 


0.04359 


0.094 


51 


0.0414 


1.0539 




Cyprinus carpio 


AB027305 


146 


9 


0.06056 


0.390 


47 


0.0563 


1.0757 




Human 


M19045 


149 


7 


0.04341 


0.122 


52 


0.0411 


1.0567 




Pig 


U44435 


149 


8 


0.04946 


0.503 


51 


0.0481 


1.0287 


Lactate 


Alligator 


L79952 


334 


16 


0.05460 


0.441 


58 


0.0459 


1.1890 


dehydro- 


Cyprinus carpio 


AF076528 


334 


23 


0.0708 


2.166 


53 


0.0680 


1.0401 


genase 


Human 


U13680 


333 


20 


0.05961 


3.075 


57 


0.0587 


1.0157 




Pig 


U95378 


333 


19 


0.05461 


2.347 


57 


0.0555 


0.9838 




Pigeon 


L79957 


334 


19 


0.05536 


2.110 


56 


0.0553 


1.0003 




Clawed Frog 


AF070953 


333 


20 


0.05831 


2.010 


53 


0.0586 


0.9935 
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Phospho- 


Candida albicans 


U25180 


418 


34 


0.08126 


2.388 


38 


0.0821 


0.9901 


glycerate 


Leishmania major 


L25120 


418 


34 


0.08677 


1.132 


56 


0.0821 


1.0573 


kinase 


Mouse 


M15668 


418 


23 


0.05298 


1.155 


58 


0.0540 


0.9807 




Rat 


M31788 


418 


23 


0.05374 


1.825 


60 


0.0540 


0.9948 




Schistosoma mansoni 


L36833 


417 


29 


0.07284 


5.498 


60 


0.0694 


1.0494 


Carboxy- 


Aedes aegypti 


AF165923 


428 


20 


0.04373 


1.785 


61 


0.0454 


0.9636 


peptidase 


Bovine 


M61851 


420 


22 


0.05170 


0.417 


59 


0.0512 


1.0088 


A 


Human 


M27717 


418 


20 


0.04477 


1.128 


59 


0.0465 


0.9630 




Mouse 


J05118 


418 


23 


0.05124 


6.547 


58 


0.0540 


0.9485 
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